Deep Learning: A Simple Example¶
Let’s get back to the Name Gender Classifier.

Prepare Data¶
import numpy as np
import nltk
from nltk.corpus import names
import random
labeled_names = ([(name, 1) for name in names.words('male.txt')] +
[(name, 0) for name in names.words('female.txt')])
random.shuffle(labeled_names)
Train-Test Split¶
from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(labeled_names, test_size = 0.2, random_state=42)
print(len(train_set), len(test_set))
6355 1589
import tensorflow as tf
import tensorflow.keras as keras
from keras.preprocessing.text import Tokenizer
from keras.preprocessing import sequence
from keras.utils import to_categorical, plot_model
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM, RNN, GRU
from keras.layers import Embedding
from keras.layers import SpatialDropout1D
names = [n for (n, l) in train_set]
labels = [l for (n, l) in train_set]
len(names)
6355
Tokenizer¶
By default, the token index 0 is reserved for padding token.
If
oov_tokenis specified, it is default to index 1.Specify
num_wordsfor tokenizer to include only top N words in the modelTokenizer will automatically remove puntuations.
Tokenizer use whitespace as word delimiter.
If every character is treated as a token, specify
char_level=True.
tokenizer = Tokenizer(char_level=True)
tokenizer.fit_on_texts(names)
Prepare Input and Output Tensors¶
Like in feature-based machine translation, a computational model only accepts numeric values. It is necessary to convert raw text to numeric tensor for neural network.
After we create the Tokenizer, we use the Tokenizer to perform text vectorization, i.e., converting texts into tensors.
In deep learning, words or characters are automatically converted into numeric representations.
In other words, the feature engineering step is fully automatic.
Two Ways of Text Vectorization¶
Texts to Sequences: Integer encoding of tokens in texts and learn token embeddings
Texts to Matrix: One-hot encoding of texts (similar to bag-of-words model)
Method 1: Text to Sequences¶
From Texts and Sequences¶
Text to Sequences
Padding to uniform lengths for each text
names_ints = tokenizer.texts_to_sequences(names)
print(names[:10])
print(names_ints[:10])
print(labels[:10])
['Bette-Ann', 'Ebeneser', 'Eimile', 'Toby', 'Gavrielle', 'Meris', 'Audrey', 'Hermine', 'Corilla', 'Krystyna']
[[15, 2, 8, 8, 2, 27, 1, 4, 4], [2, 15, 2, 4, 2, 9, 2, 5], [2, 3, 11, 3, 6, 2], [8, 7, 15, 13], [17, 1, 19, 5, 3, 2, 6, 6, 2], [11, 2, 5, 3, 9], [1, 16, 10, 5, 2, 13], [12, 2, 5, 11, 3, 4, 2], [14, 7, 5, 3, 6, 6, 1], [18, 5, 13, 9, 8, 13, 4, 1]]
[0, 1, 0, 1, 0, 0, 0, 0, 0, 0]
Vocabulary¶
# determine the vocabulary size
vocab_size = len(tokenizer.word_index) + 1
print('Vocabulary Size: %d' % vocab_size)
Vocabulary Size: 30
tokenizer.word_index
{'a': 1,
'e': 2,
'i': 3,
'n': 4,
'r': 5,
'l': 6,
'o': 7,
't': 8,
's': 9,
'd': 10,
'm': 11,
'h': 12,
'y': 13,
'c': 14,
'b': 15,
'u': 16,
'g': 17,
'k': 18,
'v': 19,
'j': 20,
'f': 21,
'p': 22,
'w': 23,
'z': 24,
'x': 25,
'q': 26,
'-': 27,
' ': 28,
"'": 29}
Padding¶
When padding the all texts into uniform lengths, consider whether to Pre-padding or removing values from the beginning of the sequence (i.e.,
pre) or the other way (post).Check
paddingandtruncatingparameters inpad_sequences
names_lens=[len(n) for n in names_ints]
names_lens
import seaborn as sns
sns.displot(names_lens)
print(names[np.argmax(names_lens)]) # longest name
Helen-Elizabeth
max_len = names_lens[np.argmax(names_lens)]
max_len
15
names_ints_pad = sequence.pad_sequences(names_ints, maxlen = max_len)
names_ints_pad[:10]
array([[ 0, 0, 0, 0, 0, 0, 15, 2, 8, 8, 2, 27, 1, 4, 4],
[ 0, 0, 0, 0, 0, 0, 0, 2, 15, 2, 4, 2, 9, 2, 5],
[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 3, 11, 3, 6, 2],
[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8, 7, 15, 13],
[ 0, 0, 0, 0, 0, 0, 17, 1, 19, 5, 3, 2, 6, 6, 2],
[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 11, 2, 5, 3, 9],
[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 16, 10, 5, 2, 13],
[ 0, 0, 0, 0, 0, 0, 0, 0, 12, 2, 5, 11, 3, 4, 2],
[ 0, 0, 0, 0, 0, 0, 0, 0, 14, 7, 5, 3, 6, 6, 1],
[ 0, 0, 0, 0, 0, 0, 0, 18, 5, 13, 9, 8, 13, 4, 1]],
dtype=int32)
Define X and Y¶
X_train = np.array(names_ints_pad).astype('int32')
y_train = np.array(labels)
X_test = np.array(sequence.pad_sequences(
tokenizer.texts_to_sequences([n for (n,l) in test_set]),
maxlen = max_len)).astype('int32')
y_test = np.array([l for (n,l) in test_set])
X_test_texts = [n for (n,l) in test_set]
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)
(6355, 15)
(6355,)
(1589, 15)
(1589,)
Method 2: Text to Matrix¶
One-Hot Encoding¶
Text to Matrix (to create bag-of-word representation of each text)
Choose modes: binary, count, or tfidf
names_matrix = tokenizer.texts_to_matrix(names, mode="binary")
names[2]
'Eimile'
names_matrixin fact is a bag-of-characters representation of a name text.
import pandas as pd
pd.DataFrame(names_matrix[2,1:],
columns=["ONE-HOT"],
index=list(tokenizer.word_index.keys()))
| ONE-HOT | |
|---|---|
| a | 0.0 |
| e | 1.0 |
| i | 1.0 |
| n | 0.0 |
| r | 0.0 |
| l | 1.0 |
| o | 0.0 |
| t | 0.0 |
| s | 0.0 |
| d | 0.0 |
| m | 1.0 |
| h | 0.0 |
| y | 0.0 |
| c | 0.0 |
| b | 0.0 |
| u | 0.0 |
| g | 0.0 |
| k | 0.0 |
| v | 0.0 |
| j | 0.0 |
| f | 0.0 |
| p | 0.0 |
| w | 0.0 |
| z | 0.0 |
| x | 0.0 |
| q | 0.0 |
| - | 0.0 |
| 0.0 | |
| ' | 0.0 |
Define X and Y¶
X_train2 = np.array(names_matrix).astype('int32')
y_train2 = np.array(labels)
X_test2 = tokenizer.texts_to_matrix([n for (n,l) in test_set], mode="binary").astype('int32')
y_test2 = np.array([l for (n,l) in test_set])
X_test2_texts = [n for (n,l) in test_set]
print(X_train2.shape)
print(y_train2.shape)
print(X_test2.shape)
print(y_test2.shape)
(6355, 30)
(6355,)
(1589, 30)
(1589,)
Model Definition¶
After we have defined our input and output tensors (X and y), we can define the architecture of our neural network model.
For the two ways of name vectorized representations, we try two different network structures.
Text to Sequences: Embedding + RNN
Text to Matrix: Fully connected Dense Layers
import matplotlib.pyplot as plt
import matplotlib
import pandas as pd
# Plotting results
def plot1(history):
matplotlib.rcParams['figure.dpi'] = 100
acc = history.history['accuracy']
val_acc = history.history['val_accuracy']
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1, len(acc)+1)
## Accuracy plot
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()
## Loss plot
plt.figure()
plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()
plt.show()
def plot2(history):
pd.DataFrame(history.history).plot(figsize=(8,5))
plt.grid(True)
#plt.gca().set_ylim(0,1)
plt.show()
Model 1: Fully Connected Dense Layers¶
Two fully-connected dense layers with the Text-to-Matrix inputs

from keras import layers
model1 = keras.Sequential()
model1.add(keras.Input(shape=(vocab_size,), name="one_hot_input"))
model1.add(layers.Dense(16, activation="relu", name="dense_layer_1"))
model1.add(layers.Dense(16, activation="relu", name="dense_layer_2"))
model1.add(layers.Dense(1, activation="sigmoid", name="output"))
model1.compile(
loss=keras.losses.BinaryCrossentropy(),
optimizer=keras.optimizers.Adam(lr=0.001),
metrics=["accuracy"]
)
plot_model(model1, show_shapes=True)
A few hyperparameters for network training¶
Batch size
Epoch
Validation Split Ratio
BATCH_SIZE=128
EPOCHS=20
VALIDATION_SPLIT=0.2
history1 = model1.fit(X_train2, y_train2,
batch_size=BATCH_SIZE,
epochs=EPOCHS, verbose=2,
validation_split = VALIDATION_SPLIT)
Epoch 1/20
40/40 - 2s - loss: 0.6472 - accuracy: 0.6334 - val_loss: 0.6420 - val_accuracy: 0.6255
Epoch 2/20
40/40 - 0s - loss: 0.6316 - accuracy: 0.6341 - val_loss: 0.6317 - val_accuracy: 0.6294
Epoch 3/20
40/40 - 0s - loss: 0.6224 - accuracy: 0.6418 - val_loss: 0.6242 - val_accuracy: 0.6672
Epoch 4/20
40/40 - 0s - loss: 0.6134 - accuracy: 0.6593 - val_loss: 0.6178 - val_accuracy: 0.6766
Epoch 5/20
40/40 - 0s - loss: 0.6054 - accuracy: 0.6743 - val_loss: 0.6111 - val_accuracy: 0.6806
Epoch 6/20
40/40 - 0s - loss: 0.5977 - accuracy: 0.6804 - val_loss: 0.6050 - val_accuracy: 0.6845
Epoch 7/20
40/40 - 0s - loss: 0.5906 - accuracy: 0.6902 - val_loss: 0.5991 - val_accuracy: 0.6900
Epoch 8/20
40/40 - 0s - loss: 0.5844 - accuracy: 0.6932 - val_loss: 0.5944 - val_accuracy: 0.6947
Epoch 9/20
40/40 - 0s - loss: 0.5786 - accuracy: 0.6985 - val_loss: 0.5907 - val_accuracy: 0.6939
Epoch 10/20
40/40 - 0s - loss: 0.5735 - accuracy: 0.7028 - val_loss: 0.5869 - val_accuracy: 0.6924
Epoch 11/20
40/40 - 0s - loss: 0.5696 - accuracy: 0.7036 - val_loss: 0.5844 - val_accuracy: 0.6916
Epoch 12/20
40/40 - 0s - loss: 0.5661 - accuracy: 0.7085 - val_loss: 0.5820 - val_accuracy: 0.6987
Epoch 13/20
40/40 - 0s - loss: 0.5631 - accuracy: 0.7109 - val_loss: 0.5794 - val_accuracy: 0.6900
Epoch 14/20
40/40 - 0s - loss: 0.5604 - accuracy: 0.7126 - val_loss: 0.5774 - val_accuracy: 0.6987
Epoch 15/20
40/40 - 0s - loss: 0.5578 - accuracy: 0.7162 - val_loss: 0.5753 - val_accuracy: 0.6987
Epoch 16/20
40/40 - 0s - loss: 0.5554 - accuracy: 0.7173 - val_loss: 0.5746 - val_accuracy: 0.7050
Epoch 17/20
40/40 - 0s - loss: 0.5533 - accuracy: 0.7203 - val_loss: 0.5727 - val_accuracy: 0.7089
Epoch 18/20
40/40 - 0s - loss: 0.5519 - accuracy: 0.7217 - val_loss: 0.5710 - val_accuracy: 0.7105
Epoch 19/20
40/40 - 0s - loss: 0.5496 - accuracy: 0.7236 - val_loss: 0.5696 - val_accuracy: 0.7113
Epoch 20/20
40/40 - 0s - loss: 0.5484 - accuracy: 0.7248 - val_loss: 0.5687 - val_accuracy: 0.7120
plot2(history1)
model1.evaluate(X_test2, y_test2, batch_size=128, verbose=2)
13/13 - 0s - loss: 0.5786 - accuracy: 0.7067
[0.5785751938819885, 0.7067338228225708]
Model 2: Embedding + RNN¶
One Embedding Layer + One RNN Layer
With Text-to-Sequence inputs

EMBEDDING_DIM = 128
model2 = Sequential()
model2.add(Embedding(input_dim=vocab_size,
output_dim=EMBEDDING_DIM,
input_length=max_len,
mask_zero=True))
model2.add(layers.SimpleRNN(16, activation="relu", name="lstm_layer"))
model2.add(Dense(16, activation="relu", name="dense_layer"))
model2.add(Dense(1, activation="sigmoid", name="output"))
model2.compile(
loss=keras.losses.BinaryCrossentropy(),
optimizer=keras.optimizers.Adam(lr=0.001),
metrics=["accuracy"]
)
plot_model(model2, show_shapes=True)
history2 = model2.fit(X_train, y_train,
batch_size=BATCH_SIZE,
epochs=EPOCHS, verbose=2,
validation_split = VALIDATION_SPLIT)
Epoch 1/20
40/40 - 2s - loss: 0.6051 - accuracy: 0.6357 - val_loss: 0.5358 - val_accuracy: 0.6483
Epoch 2/20
40/40 - 0s - loss: 0.4896 - accuracy: 0.7433 - val_loss: 0.4548 - val_accuracy: 0.7946
Epoch 3/20
40/40 - 0s - loss: 0.4399 - accuracy: 0.7878 - val_loss: 0.4264 - val_accuracy: 0.8017
Epoch 4/20
40/40 - 0s - loss: 0.4208 - accuracy: 0.7988 - val_loss: 0.4188 - val_accuracy: 0.8041
Epoch 5/20
40/40 - 0s - loss: 0.4088 - accuracy: 0.8084 - val_loss: 0.4108 - val_accuracy: 0.8175
Epoch 6/20
40/40 - 0s - loss: 0.3995 - accuracy: 0.8129 - val_loss: 0.4081 - val_accuracy: 0.8033
Epoch 7/20
40/40 - 0s - loss: 0.3937 - accuracy: 0.8153 - val_loss: 0.4055 - val_accuracy: 0.8104
Epoch 8/20
40/40 - 0s - loss: 0.3893 - accuracy: 0.8157 - val_loss: 0.4069 - val_accuracy: 0.8072
Epoch 9/20
40/40 - 0s - loss: 0.3838 - accuracy: 0.8190 - val_loss: 0.4063 - val_accuracy: 0.8096
Epoch 10/20
40/40 - 0s - loss: 0.3805 - accuracy: 0.8200 - val_loss: 0.4066 - val_accuracy: 0.8112
Epoch 11/20
40/40 - 0s - loss: 0.3748 - accuracy: 0.8255 - val_loss: 0.4042 - val_accuracy: 0.8175
Epoch 12/20
40/40 - 0s - loss: 0.3744 - accuracy: 0.8275 - val_loss: 0.4069 - val_accuracy: 0.8112
Epoch 13/20
40/40 - 0s - loss: 0.3693 - accuracy: 0.8350 - val_loss: 0.4064 - val_accuracy: 0.8127
Epoch 14/20
40/40 - 0s - loss: 0.3674 - accuracy: 0.8287 - val_loss: 0.4053 - val_accuracy: 0.8167
Epoch 15/20
40/40 - 0s - loss: 0.3668 - accuracy: 0.8322 - val_loss: 0.4060 - val_accuracy: 0.8135
Epoch 16/20
40/40 - 0s - loss: 0.3623 - accuracy: 0.8326 - val_loss: 0.4061 - val_accuracy: 0.8112
Epoch 17/20
40/40 - 0s - loss: 0.3616 - accuracy: 0.8336 - val_loss: 0.4066 - val_accuracy: 0.8120
Epoch 18/20
40/40 - 0s - loss: 0.3595 - accuracy: 0.8346 - val_loss: 0.4091 - val_accuracy: 0.8072
Epoch 19/20
40/40 - 0s - loss: 0.3580 - accuracy: 0.8365 - val_loss: 0.4104 - val_accuracy: 0.8096
Epoch 20/20
40/40 - 0s - loss: 0.3529 - accuracy: 0.8403 - val_loss: 0.4090 - val_accuracy: 0.8143
plot2(history2)
model2.evaluate(X_test, y_test, batch_size=128, verbose=2)
13/13 - 0s - loss: 0.4067 - accuracy: 0.8093
[0.40674683451652527, 0.8093140125274658]
Model 3: Regularization and Dropout¶
Previous two examples clearly show overfitting of the models because the model performance on the validation set starts to stall after the first few epochs.
We can implement regularization and dropouts in our network definition to avoid overfitting.
EMBEDDING_DIM = 128
model3 = Sequential()
model3.add(Embedding(input_dim=vocab_size,
output_dim=EMBEDDING_DIM,
input_length=max_len,
mask_zero=True))
model3.add(layers.SimpleRNN(16, activation="relu", name="lstm_layer", dropout=0.2, recurrent_dropout=0.2))
model3.add(Dense(16, activation="relu", name="dense_layer"))
model3.add(Dense(1, activation="sigmoid", name="output"))
model3.compile(
loss=keras.losses.BinaryCrossentropy(),
optimizer=keras.optimizers.Adam(lr=0.001),
metrics=["accuracy"]
)
plot_model(model3)
history3 = model3.fit(X_train, y_train,
batch_size=BATCH_SIZE,
epochs=EPOCHS, verbose=2,
validation_split = VALIDATION_SPLIT)
Epoch 1/20
40/40 - 2s - loss: 0.6245 - accuracy: 0.6442 - val_loss: 0.5580 - val_accuracy: 0.6593
Epoch 2/20
40/40 - 0s - loss: 0.5194 - accuracy: 0.7292 - val_loss: 0.4733 - val_accuracy: 0.7773
Epoch 3/20
40/40 - 0s - loss: 0.4694 - accuracy: 0.7679 - val_loss: 0.4456 - val_accuracy: 0.7876
Epoch 4/20
40/40 - 0s - loss: 0.4485 - accuracy: 0.7805 - val_loss: 0.4303 - val_accuracy: 0.7970
Epoch 5/20
40/40 - 0s - loss: 0.4497 - accuracy: 0.7769 - val_loss: 0.4278 - val_accuracy: 0.8033
Epoch 6/20
40/40 - 0s - loss: 0.4388 - accuracy: 0.7852 - val_loss: 0.4223 - val_accuracy: 0.7970
Epoch 7/20
40/40 - 0s - loss: 0.4366 - accuracy: 0.7862 - val_loss: 0.4195 - val_accuracy: 0.8017
Epoch 8/20
40/40 - 0s - loss: 0.4333 - accuracy: 0.7846 - val_loss: 0.4231 - val_accuracy: 0.7962
Epoch 9/20
40/40 - 0s - loss: 0.4294 - accuracy: 0.7852 - val_loss: 0.4178 - val_accuracy: 0.7994
Epoch 10/20
40/40 - 0s - loss: 0.4307 - accuracy: 0.7897 - val_loss: 0.4135 - val_accuracy: 0.8009
Epoch 11/20
40/40 - 0s - loss: 0.4284 - accuracy: 0.7913 - val_loss: 0.4114 - val_accuracy: 0.8009
Epoch 12/20
40/40 - 0s - loss: 0.4283 - accuracy: 0.7899 - val_loss: 0.4121 - val_accuracy: 0.7970
Epoch 13/20
40/40 - 0s - loss: 0.4231 - accuracy: 0.7923 - val_loss: 0.4110 - val_accuracy: 0.7978
Epoch 14/20
40/40 - 0s - loss: 0.4226 - accuracy: 0.7919 - val_loss: 0.4145 - val_accuracy: 0.7986
Epoch 15/20
40/40 - 0s - loss: 0.4247 - accuracy: 0.7905 - val_loss: 0.4088 - val_accuracy: 0.8041
Epoch 16/20
40/40 - 0s - loss: 0.4261 - accuracy: 0.7899 - val_loss: 0.4142 - val_accuracy: 0.7946
Epoch 17/20
40/40 - 0s - loss: 0.4244 - accuracy: 0.7921 - val_loss: 0.4111 - val_accuracy: 0.7986
Epoch 18/20
40/40 - 0s - loss: 0.4179 - accuracy: 0.7952 - val_loss: 0.4108 - val_accuracy: 0.7978
Epoch 19/20
40/40 - 0s - loss: 0.4205 - accuracy: 0.7958 - val_loss: 0.4065 - val_accuracy: 0.7954
Epoch 20/20
40/40 - 0s - loss: 0.4177 - accuracy: 0.7950 - val_loss: 0.4039 - val_accuracy: 0.8009
plot2(history3)
model3.evaluate(X_test, y_test, batch_size=128, verbose=2)
13/13 - 0s - loss: 0.4232 - accuracy: 0.7904
[0.4231923818588257, 0.7904342412948608]
Model 4: Improve the Models¶
In addition to regularization and dropouts, we can further improve the model by increasing the model complexity.
In particular, we can increase the depths and widths of the network layers.
Let’s try stack two RNN layers.
EMBEDDING_DIM = 128
model4 = Sequential()
model4.add(Embedding(input_dim=vocab_size,
output_dim=EMBEDDING_DIM,
input_length=max_len,
mask_zero=True))
model4.add(layers.SimpleRNN(16, activation="relu", name="lstm_layer_1",
dropout=0.2, recurrent_dropout=0.2, return_sequences=True))
model4.add(layers.SimpleRNN(16, activation="relu", name="lstm_layer_2",
dropout=0.2, recurrent_dropout=0.2))
model4.add(Dense(1, activation="sigmoid", name="output"))
model4.compile(
loss=keras.losses.BinaryCrossentropy(),
optimizer=keras.optimizers.Adam(lr=0.001),
metrics=["accuracy"]
)
plot_model(model4)
history4 = model4.fit(X_train, y_train,
batch_size=BATCH_SIZE,
epochs=EPOCHS, verbose=2,
validation_split = VALIDATION_SPLIT)
Epoch 1/20
40/40 - 3s - loss: 0.6232 - accuracy: 0.6339 - val_loss: 0.5616 - val_accuracy: 0.6381
Epoch 2/20
40/40 - 0s - loss: 0.5431 - accuracy: 0.6796 - val_loss: 0.5015 - val_accuracy: 0.7553
Epoch 3/20
40/40 - 1s - loss: 0.5052 - accuracy: 0.7347 - val_loss: 0.4751 - val_accuracy: 0.7640
Epoch 4/20
40/40 - 0s - loss: 0.4844 - accuracy: 0.7606 - val_loss: 0.4592 - val_accuracy: 0.7797
Epoch 5/20
40/40 - 0s - loss: 0.4755 - accuracy: 0.7673 - val_loss: 0.4461 - val_accuracy: 0.7891
Epoch 6/20
40/40 - 0s - loss: 0.4626 - accuracy: 0.7722 - val_loss: 0.4347 - val_accuracy: 0.7923
Epoch 7/20
40/40 - 0s - loss: 0.4550 - accuracy: 0.7785 - val_loss: 0.4313 - val_accuracy: 0.7939
Epoch 8/20
40/40 - 0s - loss: 0.4579 - accuracy: 0.7771 - val_loss: 0.4308 - val_accuracy: 0.7923
Epoch 9/20
40/40 - 0s - loss: 0.4464 - accuracy: 0.7807 - val_loss: 0.4216 - val_accuracy: 0.8025
Epoch 10/20
40/40 - 0s - loss: 0.4469 - accuracy: 0.7781 - val_loss: 0.4219 - val_accuracy: 0.7923
Epoch 11/20
40/40 - 0s - loss: 0.4441 - accuracy: 0.7830 - val_loss: 0.4175 - val_accuracy: 0.7923
Epoch 12/20
40/40 - 0s - loss: 0.4368 - accuracy: 0.7897 - val_loss: 0.4162 - val_accuracy: 0.7978
Epoch 13/20
40/40 - 0s - loss: 0.4390 - accuracy: 0.7864 - val_loss: 0.4145 - val_accuracy: 0.8025
Epoch 14/20
40/40 - 0s - loss: 0.4365 - accuracy: 0.7899 - val_loss: 0.4134 - val_accuracy: 0.8002
Epoch 15/20
40/40 - 0s - loss: 0.4329 - accuracy: 0.7927 - val_loss: 0.4105 - val_accuracy: 0.8017
Epoch 16/20
40/40 - 0s - loss: 0.4283 - accuracy: 0.7948 - val_loss: 0.4096 - val_accuracy: 0.8033
Epoch 17/20
40/40 - 0s - loss: 0.4246 - accuracy: 0.7941 - val_loss: 0.4087 - val_accuracy: 0.8041
Epoch 18/20
40/40 - 0s - loss: 0.4276 - accuracy: 0.7948 - val_loss: 0.4077 - val_accuracy: 0.8041
Epoch 19/20
40/40 - 0s - loss: 0.4218 - accuracy: 0.7976 - val_loss: 0.4126 - val_accuracy: 0.8049
Epoch 20/20
40/40 - 0s - loss: 0.4297 - accuracy: 0.7978 - val_loss: 0.4103 - val_accuracy: 0.8065
plot2(history4)
model4.evaluate(X_test, y_test, batch_size=128, verbose=2)
13/13 - 0s - loss: 0.4212 - accuracy: 0.8068
[0.42124617099761963, 0.8067967295646667]
Model 5: Bidirectional¶
Now let’s try the more sophisticated RNN, LSTM, and with birectional computing.
And add more nodes to the LSTM layer.
EMBEDDING_DIM = 128
model5 = Sequential()
model5.add(Embedding(input_dim=vocab_size,
output_dim=EMBEDDING_DIM,
input_length=max_len,
mask_zero=True))
model5.add(layers.Bidirectional(LSTM(32, activation="relu", name="lstm_layer", dropout=0.2, recurrent_dropout=0.2)))
model5.add(Dense(1, activation="sigmoid", name="output"))
model5.compile(
loss=keras.losses.BinaryCrossentropy(),
optimizer=keras.optimizers.Adam(lr=0.001),
metrics=["accuracy"]
)
plot_model(model5)
history5 = model5.fit(X_train, y_train,
batch_size=BATCH_SIZE,
epochs=EPOCHS, verbose=2,
validation_split = VALIDATION_SPLIT)
Epoch 1/20
40/40 - 8s - loss: 0.6625 - accuracy: 0.6184 - val_loss: 0.6262 - val_accuracy: 0.6255
Epoch 2/20
40/40 - 1s - loss: 0.5757 - accuracy: 0.6762 - val_loss: 0.4984 - val_accuracy: 0.7569
Epoch 3/20
40/40 - 1s - loss: 0.4699 - accuracy: 0.7764 - val_loss: 0.4438 - val_accuracy: 0.7939
Epoch 4/20
40/40 - 2s - loss: 0.4414 - accuracy: 0.7870 - val_loss: 0.4359 - val_accuracy: 0.7907
Epoch 5/20
40/40 - 2s - loss: 0.4330 - accuracy: 0.7903 - val_loss: 0.4289 - val_accuracy: 0.7970
Epoch 6/20
40/40 - 1s - loss: 0.4263 - accuracy: 0.7980 - val_loss: 0.4218 - val_accuracy: 0.8088
Epoch 7/20
40/40 - 1s - loss: 0.4208 - accuracy: 0.7994 - val_loss: 0.4177 - val_accuracy: 0.8120
Epoch 8/20
40/40 - 1s - loss: 0.4124 - accuracy: 0.8007 - val_loss: 0.4131 - val_accuracy: 0.8096
Epoch 9/20
40/40 - 1s - loss: 0.4081 - accuracy: 0.8061 - val_loss: 0.4099 - val_accuracy: 0.8120
Epoch 10/20
40/40 - 1s - loss: 0.4050 - accuracy: 0.8082 - val_loss: 0.4100 - val_accuracy: 0.8127
Epoch 11/20
40/40 - 1s - loss: 0.4045 - accuracy: 0.8061 - val_loss: 0.4081 - val_accuracy: 0.8120
Epoch 12/20
40/40 - 2s - loss: 0.4016 - accuracy: 0.8078 - val_loss: 0.4049 - val_accuracy: 0.8127
Epoch 13/20
40/40 - 1s - loss: 0.3993 - accuracy: 0.8114 - val_loss: 0.4020 - val_accuracy: 0.8120
Epoch 14/20
40/40 - 2s - loss: 0.3976 - accuracy: 0.8074 - val_loss: 0.4020 - val_accuracy: 0.8167
Epoch 15/20
40/40 - 2s - loss: 0.3943 - accuracy: 0.8100 - val_loss: 0.4013 - val_accuracy: 0.8143
Epoch 16/20
40/40 - 2s - loss: 0.3884 - accuracy: 0.8129 - val_loss: 0.3985 - val_accuracy: 0.8143
Epoch 17/20
40/40 - 2s - loss: 0.3861 - accuracy: 0.8163 - val_loss: 0.3994 - val_accuracy: 0.8151
Epoch 18/20
40/40 - 2s - loss: 0.3878 - accuracy: 0.8145 - val_loss: 0.3971 - val_accuracy: 0.8080
Epoch 19/20
40/40 - 1s - loss: 0.3829 - accuracy: 0.8167 - val_loss: 0.3909 - val_accuracy: 0.8190
Epoch 20/20
40/40 - 1s - loss: 0.3797 - accuracy: 0.8181 - val_loss: 0.3903 - val_accuracy: 0.8167
plot2(history5)
model5.evaluate(X_test, y_test, batch_size=128, verbose=2)
13/13 - 0s - loss: 0.4044 - accuracy: 0.7992
[0.40442442893981934, 0.7992448210716248]
Check Embeddings¶
Compared to one-hot encodings of characters, embeddings may include more information relating to the characteristics of the characters.
We can extract the embedding layer and apply dimensional reduction techniques (i.e., TSNE) to see how embeddings capture the relationships in-between characters.
X_test[10]
array([ 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 12, 1, 5, 7, 4],
dtype=int32)
ind2char = tokenizer.index_word
[ind2char.get(i) for i in X_test[10] if ind2char.get(i)!= None ]
['a', 'h', 'a', 'r', 'o', 'n']
tokenizer.texts_to_sequences('Alvin')
[[1], [6], [19], [3], [4]]
char_vectors = model5.layers[0].get_weights()[0]
char_vectors.shape
(30, 128)
labels = [char for (ind, char) in tokenizer.index_word.items()]
labels.insert(0,None)
labels
[None,
'a',
'e',
'i',
'n',
'r',
'l',
'o',
't',
's',
'd',
'm',
'h',
'y',
'c',
'b',
'u',
'g',
'k',
'v',
'j',
'f',
'p',
'w',
'z',
'x',
'q',
'-',
' ',
"'"]
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, random_state=0, n_iter=5000, perplexity=3)
np.set_printoptions(suppress=True)
T = tsne.fit_transform(char_vectors)
labels = labels
plt.figure(figsize=(10, 7), dpi=150)
plt.scatter(T[:, 0], T[:, 1], c='orange', edgecolors='r')
for label, x, y in zip(labels, T[:, 0], T[:, 1]):
plt.annotate(label, xy=(x+1, y+1), xytext=(0, 0), textcoords='offset points')
Issues of Word/Character Representations¶
One-hot encoding does not indicate semantic relationships between characters.
For deep learning NLP, it is preferred to convert one-hot encodings of words/characters into embeddings, which are argued to include more semantic information of the tokens.
Now the question is how to train and create better word embeddings. We will come back to this issue later.
Hyperparameter Tuning¶
Note
Please install keras tuner module in your current conda:
pip install -U keras-tuner
Like feature-based ML methods, neural networks also come with many hyperparameters, which require default values.
Typical hyperparameters include:
Number of nodes for the layer
Learning Rates
We can utilize the module,
keras-tuner, to fine-tune the hyperparameters.
Steps for Keras Tuner
First, wrap the model definition in a function, which takes a single
hpargument.Inside this function, replace any value we want to tune with a call to hyperparameter sampling methods, e.g.
hp.Int()orhp.Choice(). The function should return a compiled model.Next, instantiate a tuner object specifying your optimization objective and other search parameters.
Finally, start the search with the
search()method, which takes the same arguments asModel.fit()in keras.When search is over, we can retrieve the best model and a summary of the results from the
tunner.
import kerastuner
## Wrap model definition in a function
## and specify the parameters needed for tuning
# def build_model(hp):
# model1 = keras.Sequential()
# model1.add(keras.Input(shape=(max_len,)))
# model1.add(layers.Dense(hp.Int('units', min_value=32, max_value=128, step=32), activation="relu", name="dense_layer_1"))
# model1.add(layers.Dense(hp.Int('units', min_value=32, max_value=128, step=32), activation="relu", name="dense_layer_2"))
# model1.add(layers.Dense(2, activation="softmax", name="output"))
# model1.compile(
# optimizer=keras.optimizers.Adam(
# hp.Choice('learning_rate',
# values=[1e-2, 1e-3, 1e-4])),
# loss='sparse_categorical_crossentropy',
# metrics=['accuracy'])
# return model1
def build_model(hp):
m= Sequential()
m.add(Embedding(input_dim=vocab_size,
output_dim=hp.Int('output_dim', min_value=32, max_value=128, step=32),
input_length=max_len,
mask_zero=True))
m.add(layers.Bidirectional(LSTM(
hp.Int('units', min_value=16, max_value=64, step=16),
activation="relu",
dropout=0.2,
recurrent_dropout=0.2)))
m.add(Dense(1, activation="sigmoid", name="output"))
m.compile(
loss=keras.losses.BinaryCrossentropy(),
optimizer=keras.optimizers.Adam(lr=0.001),
metrics=["accuracy"]
)
return m
## This is to clean up the temp dir from the tuner
## Every time we re-start the tunner, it's better to keep the temp dir clean
import os
import shutil
if os.path.isdir('my_dir'):
shutil.rmtree('my_dir')
The
max_trialsvariable represents the number of hyperparameter combinations that will be tested by the tuner.The
execution_per_trialvariable is the number of models that should be built and fit for each trial for robustness purposes.
## Instantiate the tunner
tuner = kerastuner.tuners.RandomSearch(
build_model,
objective='val_accuracy',
max_trials=10,
executions_per_trial=2,
directory='my_dir')
## Check the tuner's search space
tuner.search_space_summary()
Search space summary
Default search space size: 2
output_dim (Int)
{'default': None, 'conditions': [], 'min_value': 32, 'max_value': 128, 'step': 32, 'sampling': None}
units (Int)
{'default': None, 'conditions': [], 'min_value': 16, 'max_value': 64, 'step': 16, 'sampling': None}
%%time
## Start tuning with the tuner
tuner.search(X_train, y_train, validation_split=0.2, batch_size=128)
Trial 10 Complete [00h 00m 15s]
val_accuracy: 0.6258851289749146
Best val_accuracy So Far: 0.6266719102859497
Total elapsed time: 00h 02m 32s
INFO:tensorflow:Oracle triggered exit
CPU times: user 3min 9s, sys: 4.89 s, total: 3min 14s
Wall time: 2min 32s
## Retrieve the best models from the tuner
models = tuner.get_best_models(num_models=2)
plot_model(models[0], show_shapes=True)
## Retrieve the summary of results from the tuner
tuner.results_summary()
Results summary
Results in my_dir/untitled_project
Showing 10 best trials
Objective(name='val_accuracy', direction='max')
Trial summary
Hyperparameters:
output_dim: 96
units: 48
Score: 0.6266719102859497
Trial summary
Hyperparameters:
output_dim: 96
units: 64
Score: 0.6262785196304321
Trial summary
Hyperparameters:
output_dim: 96
units: 32
Score: 0.6258851289749146
Trial summary
Hyperparameters:
output_dim: 96
units: 16
Score: 0.625491738319397
Trial summary
Hyperparameters:
output_dim: 128
units: 16
Score: 0.625491738319397
Trial summary
Hyperparameters:
output_dim: 64
units: 16
Score: 0.625491738319397
Trial summary
Hyperparameters:
output_dim: 32
units: 16
Score: 0.625491738319397
Trial summary
Hyperparameters:
output_dim: 32
units: 48
Score: 0.625491738319397
Trial summary
Hyperparameters:
output_dim: 32
units: 32
Score: 0.625491738319397
Trial summary
Hyperparameters:
output_dim: 64
units: 32
Score: 0.625491738319397
Explanation¶
Train Model with the Tuned Hyperparameters¶
EMBEDDING_DIM = 128
model6 = Sequential()
model6.add(Embedding(input_dim=vocab_size,
output_dim=EMBEDDING_DIM,
input_length=max_len,
mask_zero=True))
model6.add(layers.Bidirectional(LSTM(64, activation="relu", name="lstm_layer", dropout=0.2, recurrent_dropout=0.2)))
model6.add(Dense(1, activation="sigmoid", name="output"))
model6.compile(
loss=keras.losses.BinaryCrossentropy(),
optimizer=keras.optimizers.Adam(lr=0.001),
metrics=["accuracy"]
)
plot_model(model6)
history6 = model6.fit(X_train, y_train,
batch_size=BATCH_SIZE,
epochs=EPOCHS, verbose=2,
validation_split = VALIDATION_SPLIT)
Epoch 1/20
40/40 - 7s - loss: 0.6501 - accuracy: 0.6286 - val_loss: 0.6126 - val_accuracy: 0.6302
Epoch 2/20
40/40 - 2s - loss: 0.5519 - accuracy: 0.7087 - val_loss: 0.4853 - val_accuracy: 0.7537
Epoch 3/20
40/40 - 2s - loss: 0.4613 - accuracy: 0.7789 - val_loss: 0.4476 - val_accuracy: 0.7821
Epoch 4/20
40/40 - 2s - loss: 0.4365 - accuracy: 0.7893 - val_loss: 0.4304 - val_accuracy: 0.7994
Epoch 5/20
40/40 - 2s - loss: 0.4269 - accuracy: 0.7986 - val_loss: 0.4269 - val_accuracy: 0.8135
Epoch 6/20
40/40 - 2s - loss: 0.4247 - accuracy: 0.7994 - val_loss: 0.4197 - val_accuracy: 0.8127
Epoch 7/20
40/40 - 2s - loss: 0.4165 - accuracy: 0.8007 - val_loss: 0.4170 - val_accuracy: 0.8127
Epoch 8/20
40/40 - 2s - loss: 0.4112 - accuracy: 0.8063 - val_loss: 0.4175 - val_accuracy: 0.8175
Epoch 9/20
40/40 - 2s - loss: 0.4060 - accuracy: 0.8086 - val_loss: 0.4171 - val_accuracy: 0.8096
Epoch 10/20
40/40 - 2s - loss: 0.4049 - accuracy: 0.8088 - val_loss: 0.4170 - val_accuracy: 0.8143
Epoch 11/20
40/40 - 2s - loss: 0.4000 - accuracy: 0.8133 - val_loss: 0.4076 - val_accuracy: 0.8167
Epoch 12/20
40/40 - 2s - loss: 0.3959 - accuracy: 0.8127 - val_loss: 0.4019 - val_accuracy: 0.8190
Epoch 13/20
40/40 - 2s - loss: 0.3907 - accuracy: 0.8129 - val_loss: 0.4000 - val_accuracy: 0.8175
Epoch 14/20
40/40 - 2s - loss: 0.3900 - accuracy: 0.8129 - val_loss: 0.3968 - val_accuracy: 0.8206
Epoch 15/20
40/40 - 2s - loss: 0.3867 - accuracy: 0.8175 - val_loss: 0.3953 - val_accuracy: 0.8183
Epoch 16/20
40/40 - 2s - loss: 0.3807 - accuracy: 0.8222 - val_loss: 0.3948 - val_accuracy: 0.8143
Epoch 17/20
40/40 - 2s - loss: 0.3769 - accuracy: 0.8220 - val_loss: 0.3883 - val_accuracy: 0.8151
Epoch 18/20
40/40 - 2s - loss: 0.3747 - accuracy: 0.8249 - val_loss: 0.3871 - val_accuracy: 0.8151
Epoch 19/20
40/40 - 2s - loss: 0.3679 - accuracy: 0.8242 - val_loss: 0.3857 - val_accuracy: 0.8190
Epoch 20/20
40/40 - 2s - loss: 0.3649 - accuracy: 0.8265 - val_loss: 0.3858 - val_accuracy: 0.8175
plot2(history6)
from lime.lime_text import LimeTextExplainer
explainer = LimeTextExplainer(class_names=['male'], char_level=True)
def model_predict_pipeline(text):
_seq = tokenizer.texts_to_sequences(text)
_seq_pad = keras.preprocessing.sequence.pad_sequences(_seq, maxlen=max_len)
#return np.array([[float(1-x), float(x)] for x in model.predict(np.array(_seq_pad))])
return model6.predict(np.array(_seq_pad))
# np.array(sequence.pad_sequences(
# tokenizer.texts_to_sequences([n for (n,l) in test_set]),
# maxlen = max_len)).astype('float32')
reversed_word_index = dict([(index, word) for (word, index) in tokenizer.word_index.items()])
text_id =305
X_test[text_id]
array([ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 10, 2, 6, 6],
dtype=int32)
X_test_texts[text_id]
'Odell'
' '.join([reversed_word_index.get(i, '?') for i in X_test[text_id]])
'? ? ? ? ? ? ? ? ? ? o d e l l'
model_predict_pipeline([X_test_texts[text_id]])
array([[0.5448343]], dtype=float32)
exp = explainer.explain_instance(
X_test_texts[text_id], model_predict_pipeline, num_features=100, top_labels=1)
exp.show_in_notebook(text=True)
y_test[text_id]
1
exp = explainer.explain_instance(
'Tim', model_predict_pipeline, num_features=100, top_labels=1)
exp.show_in_notebook(text=True)
exp = explainer.explain_instance(
'Michaelis', model_predict_pipeline, num_features=100, top_labels=1)
exp.show_in_notebook(text=True)
exp = explainer.explain_instance(
'Sidney', model_predict_pipeline, num_features=100, top_labels=1)
exp.show_in_notebook(text=True)
exp = explainer.explain_instance(
'Timber', model_predict_pipeline, num_features=100, top_labels=1)
exp.show_in_notebook(text=True)
exp = explainer.explain_instance(
'Alvin', model_predict_pipeline, num_features=100, top_labels=1)
exp.show_in_notebook(text=True)
References¶
Chollet (2017), Ch 3 and Ch 4